Skip to content

Close out the audit: mode refactor, ownership tokens, process identity, tmux spawning, extra_ports, docker e2e#25

Merged
hefgi merged 7 commits into
mainfrom
claude/magical-einstein-7m4g40-modes-refactor
Jun 11, 2026
Merged

Close out the audit: mode refactor, ownership tokens, process identity, tmux spawning, extra_ports, docker e2e#25
hefgi merged 7 commits into
mainfrom
claude/magical-einstein-7m4g40-modes-refactor

Conversation

@hefgi

@hefgi hefgi commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Fixes #10
Fixes #12
Fixes #13
Fixes #14
Fixes #16

Closes out every remaining open issue plus the five findings from the post-merge audit, as five reviewable commits:

1. refactor(modes)#12

ModeHandler::bring_up takes a BringUpRequest (was 14 positional params under a silenced clippy lint, now zero allows in the tree). The duplicated docker-startup/worktree/port-allocation/spawn blocks move to shared building blocks in modes/mod.rs; container/hybrid keep only their genuine differences (whole-file vs label-split fallbacks). Net −117 lines, no behavior change.

2. fix(state) — audit findings 1–5

  • Ownership tokens (PendingOp {id, since}): every finalize/restore path writes state only while it still owns the entry. A down/flush racing a live up now makes the loser tear down its own resources and exit non-zero instead of resurrecting a deleted session. sync refuses mid-operation sessions.
  • The branch-name prompt runs with no lock held (was: shared lock, which still blocks exclusive acquirers).
  • ls warns about sessions pending >15 min (crashed command ⇒ leaked slot) with the down <slug> recovery hint.
  • The TERM→KILL escalation is one helper (signal_with_grace), not two copies; missing pid files produce a warning instead of silence.

3. fix(process,sync)#10 + #13

  • PID files carry the process start token (ps lstart); recycled PIDs are never signaled, never attributed by whose-pid, and report as down. Legacy files still parse.
  • Containers matched by compose project label (exact), never name substrings; match_services keeps a claimed-PID set; resume/status health is identity-based (no more depth-1 lsof scan → no more duplicate spawns).
  • tmux services run as their window's process — no send-keys typing. Pane PIDs are recorded as token-verified pid files; a command dying within 1.5 s fails up with its exit status and last output (verified by tmux-gated tests, run against tmux 3.4); dead panes are kept for inspection; window shell stays interactive. Env sourcing uses POSIX ..
  • Integration tests now isolate HOME — they were silently reading the developer's real ~/.config/ecluse/config.toml (this surfaced as real failures the moment tmux behavior changed).

4. fix(config)#14

extra_ports use the same base + slot*stride rule as primaries (unchanged for the default stride 1), and are probed up front: occupied → hard error under strict_port, warning otherwise (they don't auto-bump, so the alternative was a raw bind failure). New publish_primary field replaces the implicit "extra_port with container_port suppresses the primary" rule, which validate now flags as deprecated.

5. test(e2e) + docs — #16 + remaining process debt

  • tests/docker_e2e.rs: hybrid lifecycle, rollback leaves no containers/worktree/state, the hyphenated-slug multi-compose teardown (Overlay→compose mapping is reconstructed by filename parsing at teardown — ambiguous with hyphenated slugs; store the pairs in state #9's scenario), container mode — all verified against a live daemon here; they skip cleanly where docker is absent (macOS runners). Images via the ECR public mirror (Docker Hub rate limits hit shared CI IPs).
  • SKILL.md + docs cover the new agent-visible behavior (pending sessions, recovery, identity-based kills, tmux failure detection, --force ownership warnings).
  • All examples/ and skills/examples/ migrate off deprecated on_up/on_down (their migration hooks need session env — now post_up/pre_down).
  • CHANGELOG gains the [Unreleased] section covering everything since 0.2.17.

Verification

Locally: cargo fmt --check, cargo clippy -- -D warnings, 428 unit + 25 integration + 4 docker-e2e + 2 tmux-gated tests, all green, plus live-daemon docker runs. CI gates the merge.

https://claude.ai/code/session_017UcuvzMKHVfyBCcq8ipAko


Generated by Claude Code

claude added 7 commits June 11, 2026 14:35
ModeHandler::bring_up took 14 positional parameters under a silenced
too_many_arguments lint, and container/hybrid duplicated ~100 lines of
docker startup (port maps, overlay generation, compose up, rollback
registration, pair recording) that had already drifted apart once.

bring_up now takes a BringUpRequest plus config/root/log. The shared
logic lives in modes/mod.rs: start_docker_services (group loop, scoped
or whole-file), ensure_worktree, native_ports_for_slot (filter-aware;
hybrid honors --services, host keeps its historical allocate-all
behavior), and spawn_native_services. The whole-compose-file and
labeled-data fallbacks stay mode-private. No behavior change; the
clippy allow is gone.

Fixes #12

https://claude.ai/code/session_017UcuvzMKHVfyBCcq8ipAko
…th hardening

Audit follow-ups to the pending-session work:

Every operation that marks a session Pending now records a PendingOp
{id, since}; finalize paths write state only while they still own the
entry (State::still_owned). A finalize that lost ownership — the session
was removed or taken over by a concurrent down/shutdown/flush — stands
down and tears down the resources it created instead of resurrecting a
deleted entry. restore_session is a no-op for lost ownership; sync
refuses sessions that are mid-operation; ls warns about pending entries
older than 15 minutes (slot leak after a crashed command) with the
recovery hint.

The branch-name prompt in cmd_up now runs on a cloned snapshot with no
lock held — even a shared lock blocks exclusive acquirers, so a human at
the prompt stalled every concurrent up/down for their 10s timeout.

The TERM->KILL escalation existed twice (terminate_with_grace in main,
kill_process_group in process); both now share signal_with_grace.
check_processes_alive warns when a nohup pid file is missing entirely
instead of staying silent about a killed service.

https://claude.ai/code/session_017UcuvzMKHVfyBCcq8ipAko
… as real panes

Service identification stops relying on fuzzy matching:

PID files now record the process start token (ps lstart) next to the
pid. Readers treat a live PID with a mismatched token as recycled:
kill paths refuse to signal it (and its group), whose-pid refuses to
attribute it, and health checks report the service as down. Legacy
single-line pid files still parse.

Containers are matched by the compose project label (exact) instead of
name substrings, so session feat-a can no longer adopt feat-ab's
containers or hand redis postgres's port. match_services keeps a
claimed-PID set so two services with overlapping command tokens cannot
both own one process. Resume health checks and status now use the
session's own identity (token-verified pid files, tmux windows) instead
of a depth-1 lsof scan that missed servers with a cwd in a
subdirectory and then spawned duplicates.

tmux services are spawned as the window's own process (no send-keys
typing): pane PIDs are recorded as token-verified pid files, dead panes
are kept via a session-scoped remain-on-exit hook, and a command that
dies within the 1.5s grace window fails the spawn with its exit status
and last output instead of reporting a ready session. Window 0 stays a
plain shell for interactive attach. Env sourcing uses POSIX '.' so it
works under dash.

Fixes #10
Fixes #13

https://claude.ai/code/session_017UcuvzMKHVfyBCcq8ipAko
…obing; explicit publish_primary

extra_ports were computed as base + slot everywhere while primary ports
use base + slot*stride — with slot_stride configured, secondary ports
ignored the spacing the stride exists to provide. They now share the
primary spacing rule (ServiceConfig::extra_port_for_slot), in the env,
the compose overlay, and the compose interpolation vars. build_env takes
the stride and a single merged service-config slice.

Extra ports are deterministic (no auto-bump), so an occupied one used to
surface only as a raw bind failure from the service or container.
bring_up now probes them up front: strict_port makes it a hard error
with the holding PID; otherwise it warns before starting. Skipped
(already-running) services are exempt — they hold their own ports.

The implicit rule "an extra_port with container_port suppresses the
primary publish" gains an explicit escape hatch: publish_primary on the
service wins when set, and ecluse validate warns when the implicit rule
is being relied on (deprecated).

Fixes #14

https://claude.ai/code/session_017UcuvzMKHVfyBCcq8ipAko
…eprecated hooks, changelog staged

tests/docker_e2e.rs runs real docker compose lifecycles where a daemon is
available (CI ubuntu, dev machines) and skips elsewhere: hybrid up/down,
rollback after a failed post_up leaves no containers/worktree/state, the
hyphenated-slug multi-compose teardown that motivated compose_overlays,
and container mode whole-file startup. Images come from the ECR public
mirror to dodge Docker Hub's unauthenticated rate limits on shared CI
IPs.

SKILL.md and docs/limits.md document the agent-visible behavior added in
this wave (pending sessions and their errors, stale-pending recovery,
identity-based kills, tmux instant-failure detection, --force ownership
warnings). All examples migrate off the deprecated on_up/on_down hook
names — the migration-style hooks they demonstrate need session env, so
they now use post_up/pre_down. CHANGELOG gains the Unreleased section
covering everything since 0.2.17, as the release process requires.

Fixes #16

https://claude.ai/code/session_017UcuvzMKHVfyBCcq8ipAko
skip_serializing_if dropped the empty vec on save while the missing
field deserialized to the default — an explicit opt-out silently came
back as ['.env', '.env.local'] after any config rewrite. inherit_env is
now always serialized.

https://claude.ai/code/session_017UcuvzMKHVfyBCcq8ipAko
Four tests pulling the image in parallel tripped the registry's per-IP
rate limit on shared CI runners (even on the ECR mirror). The first
caller pulls with retries behind a OnceLock; an unobtainable image skips
the suite like an absent daemon instead of failing on registry weather.

https://claude.ai/code/session_017UcuvzMKHVfyBCcq8ipAko
@hefgi hefgi merged commit 773df03 into main Jun 11, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment